Fix/eval scheduler actor wedge by ekirimlioglu · Pull Request #238 · RapidFireAI/rapidfireai

ekirimlioglu · 2026-05-01T05:33:32Z

Changes

Fixes a deterministic wedge in Controller.run_multi_pipeline_inference that caused multi-pipeline inference to hang at 0% GPU/CPU. Two independent triggers were closed:
a. An exception during the batch-submission/bookkeeping block left the actor marked busy by PipelineScheduler.schedule() with no path to free it. The block is now wrapped in a try/except that mirrors the existing init-failure handler — marks the pipeline FAILED, drops the partial
active_tasks entry, and calls scheduler.remove_pipeline so the actor is released.
b. Actor death after dispatch (OOM, segfault) left orphaned futures and the controller's busy-branch was a blind time.sleep(0.5). Replaced with ray.wait(all_pending, num_returns=1, timeout=0.5) so failed futures surface and the existing reap path frees the actor on the next iteration.
No API or behavior changes for healthy runs. No breaking changes.

Changelog Content

Additions

Unit tests covering PipelineScheduler busy/free bookkeeping invariants, including a regression test for the actor-leak wedge (tests/test_pipeline_scheduler.py).

Changes

Busy-loop in evals controller now blocks on ray.wait instead of time.sleep, so dead-actor futures are reaped promptly.

Fixes

Prevent indefinite hang in evals multi-pipeline inference when batch dispatch raises synchronously or an actor dies after dispatch.

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this change manually
I have tested this change in the following environments:
- Local development
- Docker environment
- Other: 10-config grid run that previously wedged on a multi-GPU host

Screenshots (if applicable)

N/A — backend-only change.

Checklist

My code follows the project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (no user-facing docs affected)
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this change manually
I have tested this change in the following environments:
- Local development
- Docker environment
- Other: _______________

Screenshots (if applicable)

Add screenshots to help explain your changes.

Checklist

My code follows the project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published

Performance Impact

If this PR affects performance, describe the impact and any optimizations made.

Related Issues

Fixes #(issue number)
Closes #(issue number)
Related to #(issue number)

Note

Medium Risk
Touches the multi-pipeline scheduling loop and Ray future waiting semantics; mistakes could cause new hangs or premature failures under load.

Overview
Fixes a deterministic hang in Controller.run_multi_pipeline_inference by (1) wrapping batch dispatch/bookkeeping in a try/except that marks the pipeline FAILED and calls scheduler.remove_pipeline() so the actor is freed, and (2) replacing the busy-loop time.sleep() with a ray.wait() on not-yet-ready futures so dead-actor failures surface promptly.

Adds tests/test_pipeline_scheduler.py to codify scheduler bookkeeping invariants (busy actors must be freed via set_completed_task or remove_pipeline) and to cover the regression wedge scenario.

^{Reviewed by Cursor Bugbot for commit 1d33b88. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 6e7e220. Configure here.}

arun-rfai · 2026-05-02T04:17:11Z

@ekirimlioglu Thanks for the detailed diagnosis and patch! I verified that both failure modes are real and your fix follows the existing init-failure handler pattern, which is the right shape.

Before the RapidFire AI team can proceed with reviewing and merging this, please open a GitHub Issue first describing the bugs, following the repo's Issue conventions. Specifically:

Repro steps (minimal sequence to trigger the wedge on main)
Minimal set of configs that reliably reproduces it
Machine environment (OS, Python version, Ray version, hardware — CPU/GPU, cores, memory)
Workload characteristics (which model API calls, which embedding model, batch sizes, num actors, etc.)
Observed symptoms (logs, stdout, how you confirmed the wedge vs. slow progress)
Whether the wedge is deterministic or probabilistic on your setup
Whether your patched branch resolved the wedge end-to-end on the original workload, or only addresses related failure modes

Then please reference that Issue in this PR as the one being resolved. This will help the QA team reproduce the bugs independently and validate the fix. It will also keep the project's bug-tracking conventions consistent.

Overall, thanks for the great contribution. First-time PRs of this depth are rare!

ekirimlioglu · 2026-05-04T03:11:48Z

Filed #239 with the repro, environment, and end-to-end validation notes.

Small suggestion: CONTRIBUTING.md doesn't currently mention the file-an-issue-first step. Adding a short note under "Submitting a pull request" could help future first-time contributors.

anay-rfai

@ekirimlioglu Good catch and great work on the fix.

controller.py has gone through significant changes on main since you forked, most notably #237 (api gateway) and #243 (multimodal RAG). Could you update your branch against latest upstream/main before we merge?

Update your fork either with git merge or rebase against upstream/main — your call, both are fine. We'll squash-merge at the end.
Heads-up: the auto-merge will likely succeed without conflict, but please visually verify that your dispatch try/except still wraps the actor.process_batch.remote(...) loop and the active_tasks[actor_id] = {...} assignment in the new code shape. The surrounding regions were restructured.
Re-run tests against the updated branch to confirm the wedge repro still passes.

Once that's pushed, this is good to approve and a maintainer will squash-merge. Thanks!

Two changes to evals/scheduling/controller.run_multi_pipeline_inference that close a deterministic hang we hit on a 10-config grid run. Symptom: MainThread parks in `time.sleep(0.5)` at the "Check if all actors busy" branch while Ray reports 0% GPU/CPU usage. py-spy stack shows the loop spinning; the bookkeeping says actors are busy but Ray says they are idle. The hang is permanent. Root cause has two triggers: (1) Exception during the batch-submission block. The init step (`actor.initialize_for_pipeline.remote(...)`) is already wrapped by an existing try/except that calls `scheduler.remove_pipeline`, but the subsequent `actor.process_batch.remote(...)` loop, the `active_tasks[actor_id] = {...}` assignment, and the three `db.set_actor_task_*` writes are not. If any of these raise synchronously (serialization error, transient DB hiccup, etc.) the actor was already marked busy by `scheduler.schedule()` and leaks busy state; the controller never calls `set_completed_task` or `remove_pipeline` for it. (2) Actor death after dispatch. Once batches are submitted and tracked in `active_tasks[actor_id]["futures"]`, the only completion-reaping path is `ray.wait(futures, timeout=0)` -- a non-blocking poll. If the actor process dies (OOM-kill, segfault from a broken native dep, etc.) the futures become orphaned: Ray will eventually surface them as failed, but `ray.wait(timeout=0)` only catches what is already ready in this Python tick. The controller falls into the "all actors busy" branch, sleeps 0.5s with no health check, and repeats forever. Fixes: - Wrap the batch-submission + bookkeeping block in try/except that mirrors the existing init-failure handler: mark the pipeline FAILED, drop any partial active_tasks entry, call `scheduler.remove_pipeline` so the actor is freed, and continue. - Replace the bare `time.sleep(0.5)` in the busy-loop with `ray.wait(all_pending, num_returns=1, timeout=0.5)`. ray.wait surfaces both successful and failed futures (including RayActorError from dead actors), so the next iteration's reap path at the "Check for completed tasks" block frees the actor via the existing exception handler. Falls back to time.sleep when no futures are dispatched yet (initial loop entry). The two changes are independent and complementary: either alone fixes a subset of triggers; together they close the wedge.

Adds tests/test_pipeline_scheduler.py with 6 unit tests for the busy/free transitions in PipelineScheduler. They document the invariant the controller's scheduling loop relies on: an actor is marked busy by schedule() and MUST be freed by either set_completed_task() (success) or remove_pipeline() (failure). If the controller dispatch path fails without calling one of those, the actor leaks busy state and the loop wedges. Also adds tools/repro_scheduler_wedge.py, a Ray-free standalone reproduction of the bug. Scenario 1 simulates the buggy controller path (dispatch fails, no remove_pipeline call) and detects the wedge by counting consecutive "all actors busy" returns from schedule(). Scenario 2 demonstrates recovery via remove_pipeline. Useful for maintainers to confirm the bookkeeping contract without spinning up the full evals stack. Existing tests on this branch: pytest tests/test_pipeline_scheduler.py -> 6 passed python tools/repro_scheduler_wedge.py -> exits 0, prints "BUG REPRODUCED + FIX VERIFIED"

Tighten the new comments in controller.py, drop narrative docstrings from test_pipeline_scheduler.py, and simplify repro_scheduler_wedge.py output. No behavior change.

Drop the standalone tools/repro_scheduler_wedge.py and add the wedge scenario as a regression test in test_pipeline_scheduler.py. The unit test encodes the same invariant deterministically and runs in CI, so the standalone script no longer pulls its weight.

…ompletion The previous busy-branch passed every in-flight batch future to ray.wait(num_returns=1, timeout=0.5). ray.wait is a snapshot of readiness, not an event consumer -- once any single batch resolved (which happens early in a multi-batch task), that already-ready ref permanently satisfied num_returns=1 and the 0.5s timeout never engaged. Combined with the reap path only draining when all of an actor's futures are ready, the all-actors-busy branch hot-spun the IC poll and scheduler.schedule() at thousands of iterations per second instead of the original two per second. Partition first via a non-blocking ray.wait and only block on the not-yet-ready set. Dead-actor detection is preserved: a RayActorError-bearing future is "ready" to ray.wait, lands in the ready partition, and gets drained by the next iteration's reap path via the existing handler at controller.py:1312.

ekirimlioglu · 2026-05-19T05:01:47Z

@anay-rfai

Rebased. 5 commits replayed cleanly with no conflicts. Visually confirmed the dispatch try/except still wraps the process_batch.remote() loop, the active_tasks[actor_id] = {...} assignment, and the three db.set_actor_task_* writes (controller.py L1939-L1983 in the new shape). The except block still calls scheduler.remove_pipeline(pipeline_id) so the actor is freed on any synchronous failure during batch dispatch. The busy-loop ray.wait partition fix also survived at L1796-L1804.

Tests pass:

pytest tests/test_pipeline_scheduler.py -> 7 passed in 0.96s

The regression test (test_actor_leaks_busy_when_neither_completion_nor_removal_called) still shows 20/20 busy-sentinel returns when the controller-side cleanup is skipped, confirming the wedge condition is still reachable in PipelineScheduler and the controller patch is what prevents

anay-rfai

LGTM

ekirimlioglu marked this pull request as ready for review May 1, 2026 05:40

cursor Bot reviewed May 1, 2026

View reviewed changes

Comment thread rapidfireai/evals/scheduling/controller.py Outdated

ekirimlioglu mentioned this pull request May 4, 2026

[BUG] evals controller hangs in busy-loop when actor leaks busy state #239

Open

arun-rfai requested review from anay-rfai and humaira-rf May 14, 2026 23:37

anay-rfai requested changes May 15, 2026

View reviewed changes

ekirimlioglu added 5 commits May 18, 2026 21:46

chore(evals): trim verbose comments and prints in scheduler wedge work

8d93bd9

Tighten the new comments in controller.py, drop narrative docstrings from test_pipeline_scheduler.py, and simplify repro_scheduler_wedge.py output. No behavior change.

ekirimlioglu force-pushed the fix/eval-scheduler-actor-wedge branch from 5728903 to 1d33b88 Compare May 19, 2026 05:00

ekirimlioglu requested a review from anay-rfai May 24, 2026 18:41

anay-rfai approved these changes Jun 15, 2026

View reviewed changes

anay-rfai merged commit 13910a6 into RapidFireAI:main Jun 15, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/eval scheduler actor wedge#238

Fix/eval scheduler actor wedge#238
anay-rfai merged 5 commits into
RapidFireAI:mainfrom
ekirimlioglu:fix/eval-scheduler-actor-wedge

ekirimlioglu commented May 1, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

arun-rfai commented May 2, 2026

Uh oh!

ekirimlioglu commented May 4, 2026

Uh oh!

anay-rfai left a comment

Uh oh!

ekirimlioglu commented May 19, 2026

Uh oh!

anay-rfai left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ekirimlioglu commented May 1, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Screenshots (if applicable)

Checklist

Performance Impact

Related Issues

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arun-rfai commented May 2, 2026

Uh oh!

ekirimlioglu commented May 4, 2026

Uh oh!

anay-rfai left a comment

Choose a reason for hiding this comment

Uh oh!

ekirimlioglu commented May 19, 2026

Uh oh!

anay-rfai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ekirimlioglu commented May 1, 2026 •

edited by cursor Bot

Loading